Quantitative Text Analysis

Lab Session: Week 11

Author

Instructor: Yen-Chieh Liao and Stefan Müller

Published

April 15, 2024

Installation

Practice installing this R wrapper by following the instructions provided, or by consulting the flaiR documentation at https://davidycliao.github.io/flaiR/.

Environment Setup: Python, R, and RStudio

Following Instructions Below:

Step 1: The installation consists of two parts:

  • First, install Python 3.8 or higher (avoid developmental versions and the very latest release for compatibility reasons).

  • Secondly, install R 4.2.0 or higher. For official python reference: https://flairnlp.github.io/flair/v0.13.1/. In R, our research group has a R wrapper — flaiR.

System Requirement:

  • Python (>= 3.10.x)

  • R (>= 4.2.0)

  • RStudio (The GUI interface allows users to adjust and manage the Python environment in R)

  • Anaconda or miniconda (Highly recommended for managing the Python environment, the Conda environment in RStudio can be easily changed by Tools ➟ Global Options ➟ Python..)

Step 2: Now, install flaiR in Rstudio:

install.packages("remotes")
remotes::install_github("davidycliao/flaiR", force = TRUE)

library(flaiR)
#> flaiR: An R Wrapper for Accessing Flair NLP 0.13.0

Notice:

  • When first installed, flaiR automatically detects whether you have Python 3.8 or higher. If not, flaiR will skip the automatic installation of Python and flair NLP. In this case, you will need to manually install it yourself and reload {flaiR} again. If you have correct Python installed, the {flaiR} will automatically install flair Python NLP in your global environment. If you are using {reticulate}, flaiR will typically assume the r-reticulate environment by default. At the same time, you can use py_config() to check the location of your environment. Please note that flaiR will directly install flair NLP in the Python environment that your R is using. This environment can be adjusted through RStudio by navigating to Tools -> Global Options -> Python. If there are any issues with the installation, feel free to ask question in the Slack .

  • I suggest not directly installing Python and RStudio, along with R, from the University’s AnyApp platform.

Word Embeddings

Sentence: a class is used to tokenize the input text.

WordEmbeddings: a class is used to embed tokenized text.

library(flaiR)
flaiR: An R Wrapper for Accessing Flair NLP 0.12.2
Sentence <- flair_data()$Sentence
WordEmbeddings <- flair_embeddings()$WordEmbeddings

Classic Word Embeddings

  • GloVe embeddings are Pytorch vectors of dimensionality 100.

  • For English, Flair provides a few more options. Here, you can use ‘en-glove’ and ‘en-extvec’ with the WordEmbeddings class.

ID Language Embedding
‘en-glove’ (or ‘glove’) English GloVe embeddings
‘en-extvec’ (or ‘extvec’) English Komninos embeddings
‘en-crawl’ (or ‘crawl’) English FastText embeddings over Web crawls
‘en-twitter’ (or ‘twitter’) English Twitter embeddings
‘en-turian’ (or ‘turian’) English Turian embeddings (small)
‘en’ (or ‘en-news’ or ‘news’) English FastText embeddings over wikipedia data
embedding <- WordEmbeddings("glove") 
  • Print the class
print(embedding)
WordEmbeddings(
  'glove'
  (embedding): Embedding(400001, 100)
)

Tokenize & Embed

# Tokenize the text
sentence = Sentence("King Queen man woman Paris London apple orange Taiwan Dublin Bamberg") 

# Embed the sentence text using the loaded model.
embedding$embed(sentence)
[[1]]
Sentence[11]: "King Queen man woman Paris London apple orange Taiwan Dublin Bamberg"
  • The sentence is being embedded with the corresponding vector from the model, store to the list.
sen_list <- list()
for (i in seq_along(sentence$tokens)) {
  # store the tensor vectors to numeric vectors
  sen_list[[i]] <- as.vector(sentence$tokens[[i]]$embedding$numpy())
}
  • Extract the name list to R vector
token_texts <- sapply(sentence$tokens, function(token) token$text)
  • form the dataframe.
sen_df <- do.call(rbind, lapply(sen_list, function(x) t(data.frame(x))))
sen_df <- as.data.frame(sen_df)
rownames(sen_df) <- token_texts
print(sen_df[,1:20])
                V1        V2        V3        V4         V5       V6       V7
King    -0.3230700 -0.876160  0.219770  0.252680  0.2297600  0.73880 -0.37954
Queen   -0.5004500 -0.708260  0.553880  0.673000  0.2248600  0.60281 -0.26194
man      0.3729300  0.385030  0.710860 -0.659110 -0.0010128  0.92715  0.27615
woman    0.5936800  0.448250  0.593200  0.074134  0.1114100  1.27930  0.16656
Paris    0.9260500 -0.228180 -0.255240  0.739970  0.5007200  0.26424  0.40056
London   0.6055300 -0.050886 -0.154610 -0.123270  0.6627000 -0.28506 -0.68844
apple   -0.5985000 -0.463210  0.130010 -0.019576  0.4603000 -0.30180  0.89770
orange  -0.1496900  0.164770 -0.355320 -0.719150  0.6213000  0.74140  0.68959
Taiwan   0.0061832  0.117350  0.535380  0.787290  0.6427700 -0.56057 -0.35941
Dublin  -0.4281400 -0.168970  0.035079  0.133170  0.4115600  1.03810 -0.32697
Bamberg  0.4854000 -0.296800  0.103520 -0.250310  0.4100900  0.45147 -0.08002
               V8       V9        V10       V11       V12      V13      V14
King    -0.353070 -0.84369 -1.1113000 -0.302660  0.331780 -0.25113  0.30448
Queen    0.738720 -0.65383 -0.2160600 -0.338060  0.244980 -0.51497  0.85680
man     -0.056203 -0.24294  0.2463200 -0.184490  0.313980  0.48983  0.09256
woman    0.240700  0.39045  0.3276600 -0.750340  0.350070  0.76057  0.38067
Paris    0.561450  0.17908  0.0504640  0.024095 -0.064805 -0.25491  0.29661
London   0.491350 -0.68924  0.3892600  0.143590 -0.488020  0.15746  0.83178
apple   -0.656340  0.66858 -0.4916400  0.037557 -0.050889  0.64510 -0.53882
orange   0.403710 -0.24239  0.1774000 -0.950790 -0.188870 -0.02344  0.49681
Taiwan  -0.157720  0.97407 -0.1026900 -0.852620 -0.058598  1.19080  0.19279
Dublin   0.333970 -0.16726 -0.0034566 -0.361420 -0.067648 -0.45075  1.43470
Bamberg -0.264430 -0.47231  0.0170920  0.036594 -0.483970 -0.18393  0.68727
              V15        V16       V17       V18       V19      V20
King    -0.077491 -0.8981500  0.092496 -1.140700 -0.583240  0.66869
Queen   -0.371990 -0.5882400  0.306370 -0.306680 -0.218700  0.78369
man      0.329580  0.1505600  0.573170 -0.185290 -0.522770  0.46191
woman    0.175170  0.0317910  0.468490 -0.216530 -0.462820  0.39967
Paris   -0.476020  0.2424400 -0.067045 -0.460290 -0.384060 -0.36540
London  -0.279230  0.0094755 -0.112070 -0.520990 -0.371590 -0.37951
apple   -0.376500 -0.0431200  0.513840  0.177830  0.285960  0.92063
orange   0.081903 -0.3694400  1.225700 -0.119000  0.955710 -0.19501
Taiwan  -0.266930 -0.7671900  0.681310 -0.240430 -0.086499 -0.18486
Dublin  -0.591370 -0.3136400  0.602490  0.145310 -0.351880  0.18191
Bamberg  0.249500  0.2045100  0.517300  0.084214 -0.115300 -0.53820

Dimension Reduction (PCA)

# Set the seed for reproducibility
set.seed(123)

# Execute PCA
pca_result <- prcomp(sen_df, center = TRUE, scale. = TRUE)
word_embeddings_matrix <- as.data.frame(pca_result$x[,1:3] )
rownames(word_embeddings_matrix) <- token_texts
word_embeddings_matrix
               PC1       PC2         PC3
King    -2.9120910  1.285200 -1.95053854
Queen   -2.2413804  2.266714 -1.09020972
man     -5.6381902  2.984461  3.55462010
woman   -6.4891003  2.458607  3.56693660
Paris    3.0702212  5.039061 -2.65962020
London   5.3196216  4.368433 -2.60726627
apple    0.3362535 -8.679358 -0.44752722
orange  -0.0485467 -4.404101  0.77151480
Taiwan  -2.7993829 -4.149287 -6.33296039
Dublin   5.8994096  1.063291 -0.09271925
Bamberg  5.5031854 -2.233020  7.28777009

2D Plot

library(ggplot2)
plot2D <- ggplot(word_embeddings_matrix, aes(x = PC1, y = PC2, color = PC3, 
                                             label = rownames(word_embeddings_matrix))) +
  geom_point(size = 3) + 
  geom_text(vjust = 1.5, hjust = 0.5) +  
  scale_color_gradient(low = "blue", high = "red") + 
  theme_minimal() +  
  labs(title = "", x = "PC1", y = "PC2", color = "PC3") 
  # guides(color = "none")  
plot2D

Bonus: 3D Plot

plotly in R API: https://plotly.com/r/

library(plotly)
plot3D <- plot_ly(data = word_embeddings_matrix, 
                  x = ~PC1, y = ~PC2, z = ~PC3, 
                  type = "scatter3d", mode = "markers",
                  marker = list(size = 5), 
                  text = rownames(word_embeddings_matrix), hoverinfo = 'text')

plot3D

Tasks

  • Use the script provided above with a different word embedding model and perform the PCA again to observe any differences that emerge?

  • Create a sentence object, extract the vectors, and perform PCA again. Then, compare the results from different models.